-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HealthCheck log output options #23900
Conversation
331ce42
to
ebadb8b
Compare
bf7e1cb
to
11b35a5
Compare
244c8a9
to
eb80e95
Compare
eb80e95
to
9d1cc56
Compare
libpod/events/journal_linux.go
Outdated
if ee.HealthLog != "" { | ||
m["PODMAN_HEALTH_LOG"] = ee.HealthLog | ||
} | ||
m["PODMAN_HEALTH_FAILING_STREAK"] = strconv.Itoa(ee.HealthFailingStreak) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only write this when we actually write a hc event.
Right now all events would get PODMAN_HEALTH_FAILING_STREAK set to 0 which doesn't make sense, I guess the same issue applies to PODMAN_HEALTH_STATUS
libpod/define/container_inspect.go
Outdated
// HealthLogDestination defines the destination where the log is stored | ||
HealthLogDestination string `json:"HealthLogDestination,omitempty" default:"local"` | ||
// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file. | ||
// ('0' value means an infinite number of attempts in the log file) | ||
HealthMaxLogCount uint `json:"HealthcheckMaxLogCount,omitempty" default:"5"` | ||
// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log | ||
// ("0" value means an infinite log length) | ||
HealthMaxLogSize uint `json:"HealthcheckMaxLogSize,omitempty" default:"500"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What uses the default
key? AFAICT we never used this syntax anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would force a default value. For example, instead of 0 to 5. But it's not. I'll remove it.
libpod/container_config.go
Outdated
// HealthLogDestination defines the destination where the log is stored | ||
HealthLogDestination string `json:"healthLogDestination,omitempty" default:"local"` | ||
// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file. | ||
// ('0' value means an infinite number of attempts in the log file) | ||
HealthMaxLogCount uint `json:"healthMaxLogCount,omitempty" default:"5"` | ||
// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log | ||
// ("0" value means an infinite log length) | ||
HealthMaxLogSize uint `json:"healthMaxLogSize,omitempty" default:"500"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here default key, this is a duplication if we ever change the default
} else { | ||
options = append(options, libpod.WithHealthCheckLogDestination(define.DefaultHealthCheckLocalDestination)) | ||
options = append(options, libpod.WithHealthCheckMaxLogCount(define.DefaultHealthMaxLogCount)) | ||
options = append(options, libpod.WithHealthCheckMaxLogSize(define.DefaultHealthMaxLogSize)) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, I think it is better when defaults are set in libpod on not by specgen.
pkg/specgen/specgen.go
Outdated
// HealthLogDestination defines the destination where the log is stored | ||
HealthLogDestination string `json:"healthLogDestination,omitempty" default:"local"` | ||
// HealthMaxLogCount is maximum number of attempts in the HealthCheck log file. | ||
// ('0' value means an infinite number of attempts in the log file) | ||
HealthMaxLogCount uint `json:"healthMaxLogCount,omitempty" default:"5"` | ||
// HealthMaxLogSize is the maximum length in characters of stored HealthCheck log | ||
// ("0" value means an infinite log length) | ||
HealthMaxLogSize uint `json:"healthMaxLogSize,omitempty" default:"500"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here with the default duplciation.
test/system/220-healthcheck.bats
Outdated
count=$(grep -o "$msg" <<< "$output" | wc -l) | ||
assert "$count" -eq 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this produces nice errors either but I leave to @edsantiago to judge.
test/system/220-healthcheck.bats
Outdated
@@ -273,4 +273,215 @@ Log[-1].Output | \"Uh-oh on stdout!\\\nUh-oh on stderr!\\\n\" | |||
done | |||
} | |||
|
|||
@test "podman healthcheck --health-max-log-count default value (5)" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem there is rather significant duplication in what is being tested in e2e and system tests. We should not have duplicate tests as this just wastes CI time. One of them is good enough. I think using system tests make sense here as they run in gating.
And it is extremely difficult for me to figure out if these tests are indeed the same or if there are some minor differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests test the same functionality. I will remove the e2e tests.
test/system/220-healthcheck.bats
Outdated
is "$output" "" "output from 'podman healthcheck run'" | ||
done | ||
|
||
# sys podman tests executes healthcheck twice the first time healthcheck is run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, you are running them twice... The healthcheck is trigger by podman when the container is started but because you manually run_podman healthcheck run you run it twice not podman.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way too much duplication in the tests. Can you find a way to refactor? Also a few other suggestions for corrections.
I'll take another look once you deduplicate.
test/system/220-healthcheck.bats
Outdated
@@ -273,4 +273,215 @@ Log[-1].Output | \"Uh-oh on stdout!\\\nUh-oh on stderr!\\\n\" | |||
done | |||
} | |||
|
|||
@test "podman healthcheck --health-max-log-count default value (5)" { | |||
local repeat_count=10 | |||
local msg="Hello, How are you?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good practice to use randomized strings:
local msg="healthmsg-$(random_string)"
This prevents false negatives from caching, or copypasta, or various other types of logic errors. The "healthmsg" prefix will make it very easy for a future maintainer, looking at test failures, to see what type of string this is and to grep the code for it.
test/system/220-healthcheck.bats
Outdated
cid="$output" | ||
|
||
run_podman inspect $ctrname --format "{{.Config.HealthMaxLogCount}}" | ||
is "$output" "5" "HealthMaxLogCount is set to 5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop the "is set to 5": on failure, the diagnostic message will already say "expected 5, got (something else)". And if the expected value ever changes, someone might change the "5" but forget to change the "is set to 5", and the code will be impossible to understand.
Also, where does this magic 5 come from? Is it a hardcoded default in podman? Is it a containers.conf setting? (I could read the PR but this is way too huge right now). What will happen if/when that default changes or is overridden? Maybe the message should say "HealthMaxLogCount is the expected default", or some explanation of where it's coming from
test/system/220-healthcheck.bats
Outdated
for _ in $(seq 1 $repeat_count); | ||
do | ||
run_podman healthcheck run $ctrname | ||
is "$output" "" "output from 'podman healthcheck run'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: "unexpected output from podman healthcheck run (pass $i)"
(and use i
instead of _
). Any time there's a loop, it's nice to include the variables in the error message. Not super helpful in this particular case, but it's a good habit to get into.
test/system/220-healthcheck.bats
Outdated
|
||
|
||
run_podman inspect $ctrname --format "{{.State.Health.Log}}" | ||
count=$(grep -o "$msg" <<< "$output" | wc -l) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not grep -c
? As in, count=$(grep -co "$msg" <<<"$output")
test/system/220-healthcheck.bats
Outdated
|
||
run_podman inspect $ctrname --format "{{.State.Health.Log}}" | ||
count=$(grep -o "$msg" <<< "$output" | wc -l) | ||
assert "$count" -eq 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain the assertion, e.g. assert "$count" -eq 5 "Number of matching health log lines"
909bc6d
to
49f036d
Compare
@edsantiago @Luap99 PTAL, I've modified the code according to your feedback. |
Looking, but, test flakes rootless on my laptop:
|
This is caused by the fact that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactoring. A few suggestions and requests inline. As for the flake, I have no opinion yet, don't understand the code well enough to comment.
One more suggestion: could you please add comments in important places? This all makes perfect sense to you, because you're deeply immersed in the code today, but it is very hard for me to read, and one day will be very hard for you too. Simple comments (some suggestions below) can save a maintainer precious time.
Thank you!
test/system/220-healthcheck.bats
Outdated
local ctrname="c-h-$(safename)" | ||
_create_container_with_health_log_settings $ctrname $msg "{{.Config.HealthMaxLogCount}}" "--health-max-log-count $repeat_count" "$repeat_count" "HealthMaxLogCount" | ||
|
||
# This is run 11 times to check that the cap is working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: is run one more time than max-count
. And instead of 0 $repeat_count
, which is too subtle for a quick glance, $(seq 1 $repeat_count) plusone
test/system/220-healthcheck.bats
Outdated
run_podman healthcheck run $ctrname | ||
is "$output" "" "output from 'podman healthcheck run'" | ||
|
||
_check_health_log $ctrname "healthmsg-}]$" 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: substr=${msg:0:10}
then _check ... "$substr}]\$"
. With an explanation that max-log-size
truncates the echo message, because there are just way too many options and way too many magic "10"s all over, and it's confusing...
test/system/220-healthcheck.bats
Outdated
|
||
|
||
@test "podman healthcheck --health-log-destination file" { | ||
local TMP_DIR_HEALTHCHECK="$BATS_FILE_TMPDIR/healthcheck" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use $PODMAN_TMPDIR
. That's unique to each test.
test/system/220-healthcheck.bats
Outdated
run cat $healthcheck_log_path | ||
count=$(grep -co "$msg" <<< "$output") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: the run cat
is unnecessary, you can just count=$(grep -co "$msg" $healthcheck_log_path)
test/system/220-healthcheck.bats
Outdated
|
||
# The healthcheck is triggered by the podman when the container is started, but its execution depends on systemd. | ||
# And since `run_podman healthcheck run` is also run manually, it will result in two runs. | ||
run journalctl --output cat --output-fields=PODMAN_HEALTH_LOG PODMAN_ID=$cid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bats run
is evil. Or at least fiendishly tricky. It hides a lot of important stuff. Suggestion:
cmd="journalctl --output cat ...etc"
echo "$_LOG_PROMPT $cmd"
run $cmd
echo "$output"
assert "$status" -eq 0 "exit status of journalctl"
This is tedious and it's very unfortunate that we have to do things like this... but a future maintainer will greatly appreciate it when the test fails.
test/system/220-healthcheck.bats
Outdated
} | ||
|
||
@test "podman healthcheck --health-max-log-count default value (5)" { | ||
local repeat_count=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: eliminate
test/system/220-healthcheck.bats
Outdated
local ctrname="c-h-$(safename)" | ||
_create_container_with_health_log_settings $ctrname $msg "{{.Config.HealthMaxLogCount}}" "" "5" "HealthMaxLogCount is the expected default" | ||
|
||
for i in $(seq 1 $repeat_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just hardcode 10, and add a comment above, something like "run many more times than max count; confirm that excess ones are not logged"
49f036d
to
465b01d
Compare
…nation flags These flags can affect the output of the HealtCheck log. Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via `podman inspect`. It is also limited to the last five executions and the first 500 characters per execution. This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic `healthcheck service failed` record. - The `--health-log-destination` flag sets the destination of the HealthCheck log. - `none`: (default behavior) `HealthCheckResults` are stored in overlay containers. (For example: `$runroot/healthcheck.log`) - `directory`: creates a log file named `<container-ID>-healthcheck.log` with JSON `HealthCheckResults` in the specified directory. - `events_logger`: The log will be written with logging mechanism set by events_loggeri. It also saves the log to a default directory, for performance on a system with a large number of logs. - The `--health-max-log-count` flag sets the maximum number of attempts in the HealthCheck log file. - A value of `0` indicates an infinite number of attempts in the log file. - The default value is `5` attempts in the log file. - The `--health-max-log-size` flag sets the maximum length of the log stored. - A value of `0` indicates an infinite log length. - The default value is `500` log characters. Add --health-max-log-count flag Signed-off-by: Jan Rodák <[email protected]> Add --health-max-log-size flag Signed-off-by: Jan Rodák <[email protected]> Add --health-log-destination flag Signed-off-by: Jan Rodák <[email protected]>
465b01d
to
de856da
Compare
@edsantiago PTAL, I've modified the tests according to your suggestions. |
Ephemeral COPR build failed. @containers/packit-build please check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: edsantiago, Honny1, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4e38381
into
containers:main
This PR creates three new flags that can affect the output of the HealtCheck log.
Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via
podman inspect
. It is also limited to the last five executions and the first 500 characters per execution.This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic
healthcheck service failed
record.The
--health-log-destination
flag sets the destination of the HealthCheck log.none
: (default behavior)HealthCheckResults
are stored in overlay containers. (For example:./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log
)directory
: creates a log file named<container-ID>-healthcheck.log
with JSONHealthCheckResults
in the specified directory.events_logger
: The log will be written with logging mechanism set by events_logger.The
--health-max-log-count
flag sets the maximum number of attempts in the HealthCheck log file.0
indicates an infinite number of attempts in the log file.5
attempts in the log file.The
--health-max-log-size
flag sets the maximum length of the log stored.0
indicates an infinite log length.500
log characters.Does this PR introduce a user-facing change?
Fixes: RHEL-24623